Analyzing Features for the Detection of Happy Endings in German Novels
نویسندگان
چکیده
With regard to a computational representation of literary plot, this paper looks at the use of sentiment analysis for happy ending detection in German novels. Its focus lies on the investigation of previously proposed sentiment features in order to gain insight about the relevance of specific features on the one hand and the implications of their performance on the other hand. Therefore, we study various partitionings of novels, considering the highly variable concept of "ending". We also show that our approach, even though still rather simple, can potentially lead to substantial findings relevant to literary studies. Introduction Plot is fundamental for the structure of literary works. Methods for the computational representation of plot or special plot elements would therefore be a great achievement for digital literary studies. This paper looks at one such element: happy endings. We employ sentiment analysis for the detection of happy endings, but focus on a qualitative analysis of specific features and their performance in order to gain deeper insight into the automatic classification. In addition, we show how the applied method can be used for subsequent research questions, yielding interesting results with regard to publishing periods of the novels. Related Work One of the first works was on folkloristic tales, done by Mark Finlayson, who created an algorithm capable of detecting events and higher-level abstractions, such as villainy or reward (Finlayson 2012). Reiter et al., again on tales, identify events, their participants and order and use machine learning methods to find structural similarities across texts (Reiter 2013, Reiter et al. 2014). Recently, a significant amount of attention has been paid to sentiment analysis, when Matthew Jockers proposed emotional arousal as a new “method for detecting plot” (Jockers 2014). He described his idea to split novels into segments and use those to form plot trajectories (Jockers 2015). Despite general acceptance of the idea to employ sentiment analysis, his use of the Fourier Transformation to smooth the resulting plot curves was criticized (Swafford 2015, Schmidt 2015). Among other features, Micha Elsner (Elsner 2015) builds plot representations of romantic novels, again by using sentiment trajectories. He also links such trajectories with specific characters and looks at character co-occurrences. To evaluate his approach, he distinguishes real novels from artificially reordered surrogates with considerable success, showing that his methods indeed capture certain aspects of plot structure. In previous work, we used sentiment features to detect happy endings as a major plot element in German novels, reaching an F1-score of 73% (Zehe et al. 2016). Corpus and Resources Our dataset consists of 212 novels in German language mostly from the 19th century . Each 1 novel has been manually annotated as either having a happy ending (50%) or not (50%). The relevant information has been obtained from summaries of the Kindler Literary Lexikon Online and Wikipedia. If no summary was available, the corresponding parts of the novel 2 have been read by the annotators. Sentiment analysis requires a resource which lists sentiment values that human readers typically associate with certain words or phrases in a text. This paper relies on the NRC Sentiment Lexicon (Mohammad and Turney 2013), which is available in an automatically translated German version . A notable feature of this lexicon is that besides specifying binary 3 values (0 or 1) for negative and positive connotations (2 features) it also categorizes words into 8 basic emotions (anger, fear, disgust, surprise, joy, anticipation, trust and sadness), see Table 1 for an example. We add another value (the polarity) by subtracting the negative from the positive value (e.g. a word with a positive value of 0 and a negative value of 1 has a polarity value of -1). The polarity serves as an overall sentiment score, which results in 11 features. Table 1 : Example entries from the NRC Sentiment Lexicon Word/Dimension verabscheuen (to detest) bewundernswert (admirable) Zufall (coincidence) Positive 0 1 0 Negative 1 0 0 Polarity -1 1 0 Anger 1 0 0 Anticipation 0 0 0 Disgust 1 0 0 Fear 1 0 0 Joy 0 1 0 Sadness 0 0 0 Surprise 0 0 1 Trust 0 1 0 Experiments The goal of this paper is to investigate features that have been used for the detection of happy endings in novels in order to gain insight about the relevance of specific feature sets on the one hand and the implications of their performance on the other hand. To that end, we adopt the features and methods presented in Zehe et al. (2016). The parameters of the linear SVM and the partitioning into 75 segments are also adopted from this paper. 1 Source: https://textgrid.de/digitale-bibliothek 2 www.kll-online.de 3 http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm Features. Since reliable chapter annotations were not available, each novel has been split into 75 equally sized blocks, called segments . For each lemmatized word, we look up the 11 sentiment values (including polarity, see above). Then, for each segment, we calculate the respective averages, resulting in 11 scores per segment. We group those 11 scores into one feature set. Qualitative Feature Analysis . As our corpus consists of an equal number of novels with and without happy ending, the random baseline as well the majority vote baseline amount to 50% classification accuracy. Since we assumed that the relevant information for identifying happy endings can be found at the end of a novel, we first used the sentiment scores of the final segment ( ) as the fd,n only feature set, reaching an F1-score of 67%. Following the intuition that not only the last segment by itself, but also its relation to the rest of the novel are meaningful for the classification, we introduced the notion of sections : the last segment of a novel constitutes the final section , whereas the remaining segments belong to the main section . Averages were also calculated for the sections by taking the mean of each feature over all segments in the section. To further emphasize the relation between these sections, we added the differences between the sentiment scores of the final section and the average sentiment scores over all segments in the main section. However, this change did not influence the results. This led us to believe that our notion of an “ending” was not accurate enough, as the number of segments for each novel and therefore the boundaries of the final segment have been chosen rather arbitrarily. To approach this issue, we varied the partitioning into main and final section so that the final section can contain more than just the last segment. Figure 1: Classification F1-score for different partitionings into main and final section. The dashed line represents a random baseline, the dotted line shows where the maximum F1-score is reached. Figure 1 shows that classification accuracy improves when at least 75% of the segments are in the main section and reaches a peak at about 95% (this means 4 segments in the final section and 71 segments in the main section, for a total of 75 segments). With this partitioning strategy, we improve the F1-score to 68% using only the feature set for the final section ( ) and reach an F1-score of 69% when also including the differences to the fd, f inal average sentiment scores of the main section ( ). fd, main−f inal Since adding the relation between the main section and the final section improved our results in the previous setting, we tried to model the development of the sentiments towards the end of the novel in a more profound way. For example, a catastrophic event might happen shortly before the end of a novel and finally be resolved in a happy ending. To capture this intuition, we introduced one more section, namely the late-main section, which focuses on the segments right before the final section, and used the difference between the feature sets for the late-main and the final section as an additional feature set ( ). fd, late−f inal Using those three feature sets, the classification of happy endings reaches an F1-score of 70% and increases to 73% when including the feature set for the final segment. Table 2 : Classification F1-score for the different feature sets
منابع مشابه
Prediction of Happy Endings in German Novels based on Sentiment Information
Identifying plot structure in novels is a valuable step towards automatic processing of literary corpora. We present an approach to classify novels as either having a happy ending or not. To achieve this, we use features based on different sentiment lexica as input for an SVMclassifier, which yields an average F1-score of about 73%.
متن کاملPrediction of Happy Endings in German Novels
Identifying plot structure in novels is a valuable step towards automatic processing of literary corpora. We present an approach to classify novels as either having a happy ending or not. To achieve this, we use features based on different sentiment lexica as input for an SVMclassifier, which yields an average F1-score of about 73%.
متن کاملبررسی ویژگی محتوایی و شخصیتپردازی رمانهای پرفروش نوجوان منتشر شده بین سالهای 1389-1380
Purpose: To review the content and the characters of the parsonage in Persian youth best–selling novels between the 2000-2010s. Methodology: Quantitative content analysis was carried out on 14 novels for the youth reprinted more than ten times. Titles were chosen from the list prepared by the Book House. Findings: Most characters faced similar problems, such as poverty, physical illness or me...
متن کاملMEFUASN: A Helpful Method to Extract Features using Analyzing Social Network for Fraud Detection
Fraud detection is one of the ways to cope with damages associated with fraudulent activities that have become common due to the rapid development of the Internet and electronic business. There is a need to propose methods to detect fraud accurately and fast. To achieve to accuracy, fraud detection methods need to consider both kind of features, features based on user level and features based o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1611.09028 شماره
صفحات -
تاریخ انتشار 2016